Red Wine Prediction by Jade Crump

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Univariate Plots Section

I started with quality (the output variable) to understand the overall distribution of wines. The univariate graph reveals a normal distribution of wine quality, with a low score (min) of 3, mean of 5.6, and a high score (max) of 8. I also created a factor set in order to group the wines as Poor (0,4), Average (5,6), Good, (7-10)

Next, I shifted to analyzing the input variables, starting with the alcohol content. This graph reveals a right-skewed distribution, with the mean/median circling 10% alcohol by volume (10.42,10.2 respectively).

## Warning: position_stack requires constant width: output may be incorrect

Then I looked at the potassium sulphate content (which contributes to SO2). By adjusting the binwidth to appropriate level for the variable, the distribution appeared to be slightly right-skewed, with some outliers extending beyond 1 all the way out to 2.0. The mean & median are .658 and .620, but the maximum is 2.0, which is causing the right tail distribution.

Two other related variables, free sulfur dioxide and total sulfur dioxide (SO2), follow the same right-skewed distribution after adjusting the binwidths to the appropriate format. This makes sense as, logically, we would expect these variables to be correlated, and thus follow a similar distribution

Next was pH, which measures the acidity from 0 (very acidic) to 14 (very basic). The distribution of our wines appears to follow a normal distribution, with 75% of our sample wines falling within 3.2 and 3.4 on the pH scale.

## Warning: position_stack requires constant width: output may be incorrect

Then I looked at two other variables related to pH: Fixed Acidity and Volatile Acidity. At first glance, both appear slightly right-skewed, following the same distribution of pH. When I adjusted the binwidth of Volatile Acidity further, I revealed a slight bi-modal distribution around .4 and .65:

Then I looked at citric acid content, which ranges in our dataset from 0 to 1, with a right-skewed distribution. It would appear as if 0 is actually our most common value, with other peaks around .25 and .5. The mean & median is . 271 and .260, respectively, with a max of 1.0.

Then I looked at residual sugar. It’s a pretty clear right-skewed distribution, with a mean of 2.54, but a max of 15.50. 75% of the data falls between 1.90 and 2.6, however it is because of the outliers that we see such a strong right skew.

## Warning: position_stack requires constant width: output may be incorrect

Finally I looked at the chlorides (amount of salt in the wine). Chlorides had a very similar output as residual sugar, with a strong right-skew caused by some clear outliers. 75% of the data fell between .07 and .09 (median of .08), but the maximum value is .61. The max and other outliers impacts the mean, bringing it to .087…getting very close to the 3rd quartile value. Because of the strong right skew, I changed the graph to a log10 analysis of clorides, normalizing the distribution.

Univariate Analysis

What is the structure of your dataset?

The Red Wine dataset contains 1,599 red wines with 11 attributes, describing the chemical properties of the wine and the resulting quality. The quality output comes from the median rating of at least 3 wine experts, with values from 0 (very bad) to 10 (very excellent).

What is/are the main feature(s) of interest in your dataset?

I am interested to see how different features impact the quality rating of red wines. In particular, I am eager to see if & how volatile acidity and citric acid levels can help indicate/predict red wine quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I believe alcohol, pH, fixed acidity, chlorides, residual sugar and sulfur dioxide levels could also have an affect on wine quality. I think acidity levels are likely the main indicator, but will evaluate the impacts of these other features as weel.

Did you create any new variables from existing variables in the dataset?

I created a factor set for quality. Instead of the values 3,4,5, etc., I created a grouping in order to look at the wine quality more wholistically as Poor (0-4), Average (5-6), and Good (7-10). Later in my analyses (multivariate) I again created a factor set for quality - Below Average (0-5), Above Average (6-10) - in order to better delineate wine behavior.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I normalized the distribution of chlorides by transforming the graph output to reveal the log10 output of chlorides.

Bivariate Plots Section

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect
## Warning: position_stack requires constant width: output may be incorrect
## Warning: position_stack requires constant width: output may be incorrect
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

In order to get a full understanding of the relationship among the data set’s variables in a single view, I created a scatterplot matrix. From this view, there are a few highlights that caught my eye:

Citric Acid - There is a negative correlation with pH and positive correlations with sulphates and density

Volatile Acidity - There is a negative correlation between volatile acidity and quality as well as citric acid.

Alcohol - There is a positive correlation between alcohol and quality, and negative correlations with density, chlorides, and volatile acidity

Sulphates - Negative correlation with chlorides

Quality - The key output variable has a positive correlation with alcohol and sulphates. Along with a negative correlation with volatile acidity and Total SO2

The highest correlation in this data set was between pH and fixed acidity

Next I started to look further into the various relationships between variables.

I looked at citric acid against pH. I added a line to the scatterplot in order to view the median pH over citric acid levels, revealing the negative correlation between the two variables (albeit not overwhelmingly strong).

Plotting citric acid against wine quality appears to yield no discernable relationship. I now turn to other variables in my dataset in order to help predict quality.

I next investigated volatile acidity. The relationship between volatile acidity and citric acid appears to be a negative correlation until the citric acid level reaches .5, at which point it appears to trend slightly positively.

Looking at the relationship between volatile acidity and quality, I can see the negative correlation. It appears that the higher quality wines tend to have a lower, more concentrated volatile acidity (very few outliers among the Good wines)

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.580  11.000 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

Looking at alcohol, I first analysed alcohol and volatile acidity. There doesn’t appear to be a strong relationship between the two. I then shifted focus to alcohol and wine quality. Based on this graph (with the additional median quality line on the scatterplot), one can see the positive correlation between alcohol level and wine quality. By running a summary of median alcohol content by quality levels, one can further see the values backing up the graph. The median alcohol content in a wine with a quality level of 3 is 9.925, while the alcohol content of an 8 quality wine is 12.15.

Next I looked at a few additional variables against quality that didn’t seem to lead to any strong relationships. I can see a slight positive correlation between quality and sulphates.

I can also notice a slight negative correlation between Total SO2 levels and wine quality.

Plotting quality against residual sugar yielded little insight. Merely that lower levels of sugar is consistent between the various levels of quality.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

From my bivariate plots I was able to uncover different feature relationships and non-relationships within my data set. One of my key features of interest in affecting quality was citric acid. By plot of citric acid and quality was disappointing in that it revealed no strong relationship.

However, volatile acidity did prove to have a visible negative correlation with quality.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The other interesting relationship I noticed was alcohol content and quality. Not expecting the alcohol content to be an indicator of quality, I was surprised to see a very clear positive relationship between the two features. Looking at the median alcohol content at each quality level (rather than mean in order to mitigate any outlier impact), one can see a large difference: 9.9 for lowest quality up to 12.2 for the highest.

What was the strongest relationship you found?

From purely correlation, the strongest relationship in this data set is between pH and fixed acidity, which is very much expected. As pH is a measure of the acidity in wine, and fixed acidity is a part of that, the correlation supports our general knowledge and assumptions.

Multivariate Plots Section

## Warning: Removed 15 rows containing missing values (stat_smooth).
## Warning: Removed 14 rows containing missing values (stat_smooth).
## Warning: Removed 34 rows containing missing values (geom_point).

It took countless hours to reach this point, but it was during the multivariate analysis section that I decided to revisit my quality factor set. Instead of three levels, I tried to simplify to show Below Average (0-5) and Above Average (6-10) quality wines. After creating this feature and regraphing my subsequent plots, the relationship between variables became so much more clear.

Starting alcohol content and volatile acidity, I was able to graph the relationship of those two features with the new quality score. In an attempt to eliminate the affect of outliers, I looked only at the bottom 99% of data points and am able to see a relatively clear delineation between the above and below average wines. The wines with a lower alcohol content and slightly higher volatile acidity seem to rate lower in quality. While the wines with a higher alcohol content and lower volatile acidity appear to rate higher in quality.

## Warning: Removed 32 rows containing missing values (geom_point).

Next I plotted the relationship between alcohol and sulphates with a color layer for wine quality. Once again, a rather clear behavior can be see for Below Average wines vs Above Average wines. One can see a clear cluster for the below average wine (low alcohol content, lower sulphates), vs. the Above Average wine (with higher alcohol content and slightly higher sulphates).

## Warning: Removed 34 rows containing missing values (geom_point).

When graphing the relationship between volatile acidity and sulphates, with the quality overlay, once again a clear difference in behavior can be seen between the Below Average and Above Average wines. In this case, the Below Average wines tend to have higher levels of volatile acidity and just slighly lower levels of sulphates. Whereas the Above Average wines have a lower volatile acidity and slightly higher sulphate levels.

## Warning: Removed 32 rows containing missing values (geom_point).

## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## Warning: Removed 20 rows containing missing values (stat_smooth).
## Warning: Removed 12 rows containing missing values (stat_smooth).
## Warning: Removed 32 rows containing missing values (geom_point).

Then I wanted took look at the ratio of volatile acidity to fixed acidity against density, and the impact to wine quality. From the first plot, it appears the Below Average wines have a higher ratio (volatile: fixed), with Above Average wines having a lower ratio, which further suports the previous graph showing simple volatile acidity. The density doesn’t appear to vary greatly between the quality groups. Once I added a smooth line for the median, it became more clear that holding the acidity ratio constant, the Above Average wines have a lower density than the above average wines.

## Warning: Removed 13 rows containing missing values (stat_smooth).
## Warning: Removed 17 rows containing missing values (stat_smooth).
## Warning: Removed 31 rows containing missing values (geom_point).

For my final multivariate plot, I graphed the ratio of volatile acidity to fixed acidity against alcohol content, and, of course, overlaying the data with the wine quality. In this feature graph, we see that our Above Average quality wine has a higher alcohol content and lower acidity ratio, while the Below Average wine has a lower alcohol content and higher acidity ratio.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wine)
## m2: lm(formula = quality ~ alcohol + sulphates, data = wine)
## m3: lm(formula = quality ~ alcohol + sulphates + total.sulfur.dioxide, 
##     data = wine)
## m4: lm(formula = quality ~ alcohol + sulphates + total.sulfur.dioxide + 
##     chlorides, data = wine)
## m5: lm(formula = quality ~ alcohol + sulphates + total.sulfur.dioxide + 
##     chlorides + volatile.acidity + volatile.acidity:fixed.acidity, 
##     data = wine)
## 
## ===================================================================================
##                                      m1        m2        m3        m4        m5    
## -----------------------------------------------------------------------------------
## (Intercept)                        1.875***  1.375***  1.650***  1.970***  2.942***
##                                   (0.175)   (0.177)   (0.185)   (0.192)   (0.206)  
## alcohol                            0.361***  0.346***  0.329***  0.302***  0.282***
##                                   (0.017)   (0.016)   (0.017)   (0.017)   (0.017)  
## sulphates                                    0.994***  1.025***  1.282***  0.892***
##                                             (0.102)   (0.102)   (0.110)   (0.111)  
## total.sulfur.dioxide                                  -0.003*** -0.003*** -0.002***
##                                                       (0.001)   (0.001)   (0.001)  
## chlorides                                                       -2.324*** -1.746***
##                                                                 (0.405)   (0.392)  
## volatile.acidity                                                          -1.438***
##                                                                           (0.166)  
## volatile.acidity x fixed.acidity                                           0.042*  
##                                                                           (0.019)  
## -----------------------------------------------------------------------------------
## R-squared                             0.227     0.270     0.280     0.295     0.353
## adj. R-squared                        0.226     0.269     0.279     0.293     0.351
## sigma                                 0.710     0.690     0.686     0.679     0.651
## F                                   468.267   294.988   207.177   166.754   145.058
## p                                     0.000     0.000     0.000     0.000     0.000
## Log-likelihood                    -1721.057 -1675.142 -1663.543 -1647.155 -1577.954
## Deviance                            805.870   760.894   749.934   734.719   673.799
## AIC                                3448.114  3358.284  3337.085  3306.310  3171.908
## BIC                                3464.245  3379.793  3363.971  3338.573  3214.925
## N                                  1599      1599      1599      1599      1599    
## ===================================================================================

I also attempted to create a linear model to predict wine quality, based on a set of features available in the data set. Starting with alcohol, I added sulphates, total SO2, chlorides, and the volatile to fixed acidity ratio. Beginning with an R^2 value (which helps to identify goodness of fit) of .227 I was able to increase my R^2 value to .353. This value, unfortunately, would not indicate a strong fit (want to get close to 1.0). I did, however, test this model against my data set and was able to predict the correct quality value, using a 95% confidence interval.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

During my multivariate analysis, I was able to observe the impact of different feature relationships on the quality rating of red wines. Based on my bivariate analysis I was able to take some of the features that had appeared to impact quality and combine those with other features that appeared to have a correlation to each other.

As one example, alcohol and sulphates have a positive correlation both with each other and with quality. Plotting those two features together, along with wine quality, one can see that as both variables increase, the output (quality) generally increases as well. Holding sulphates content, quality is generally Above Average when alcohol content increases.

Were there any interesting or surprising interactions between features?

I think what surprised me the most was the overlap in my Above Average and Below Average quality wines. While clear distinctions could be seen in my feature graphs (sulphates vs. volatile acidity, alcohol vs. chorides, etc.), there are still many wines that defy the general trends. This surprising discovery also impacted my attempt to create a model to predict wine quality (see below).

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes I created a linear model using: alcohol, volatile/fixed acidity ratio, sulphates, chlorides, and total SO2.

This model only explains 35% of variance in quality of red wines, which was disappointing. I think part of the trouble in predicting wine quality based on this set of features is that we’re attempting to use quantitative data to predict a subjective result. There are additional limitations in the data set, discussed in my reflection, that also hinder my attmept to create a proper model.


Final Plots and Summary

Plot One

Description One

The distribution of alcohol content between Below Average and Above average is clearly distinct. Below Average wines peak at ~9.5% alcohol content with a tight distribution, while Above Average wines are more distributed, but peak at ~11% alcohol content.

Plot Two

## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## Warning: Removed 43 rows containing missing values (stat_smooth).
## Warning: Removed 52 rows containing missing values (stat_smooth).
## Warning: Removed 99 rows containing missing values (geom_point).

Description Two

Looking at the relationship between pH and sulphates and the impact on red wine quality, there appears to be distinction, although not well-defined. Holding pH constant, it appears Above Average wines have a higher sulphate content. The median smooth lines added the the graph allow one to see the distinction, however the underlying points on the graph reveal a great overlap of the Above and Below Average wines.

Plot Three

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## Warning: Removed 6 rows containing missing values (stat_smooth).
## Warning: Removed 16 rows containing missing values (stat_smooth).
## Warning: Removed 7 rows containing missing values (stat_smooth).

Description Three

The distribution in the quality of red wines is clearly different based on the relationship of Volatile Acidity and Alcohol Content. Holding the alcohol content constant, as the volatile acidity level decreases, the median quality score increases. This indicates that a higher quality red wine has lower volatile acidity.

Reflection

The Red Wine data set contained details on ~1600 wines. I had to begin my analysis by understanding the various features within my data set, in order to understand their relationships with other variables as well as with quality. Once I understood the variables at play, I worked to graph and analyze the relationships in order to predict quality output of wine. Eventually I created a linear model to do just that, including alcohol, volatile/fixed acidity ratio, chlorides, sulphates, and total sulfur dioxide. However, I was disappointed in the low R^2 value that resulted. Based on my analyses I find that alcohol content and volatile acidity likely have the greatest impact on predicting wine quality. A higher alcohol level, coupled with a lower volatile acidity, appears to result in a higher quality wine.

There are many limitations to this data set. The wines included all come from the same region: Portugal. Because of the varieties of wine from all over the world, the limits any ability to extrapolate beyond Portugal. The data set also doesn’t include certain key features of the wine that I believe (as a self-proclaimed wine connoisseur) would have a great impact on the wine quality: grape type, winery, region, and year. And the third and likely greatest limitation is that the bulk of the wines in this data set fall within the Average quality range, with limited low and high quality wines. This unbalanced data set causes great difficulty in trying to understand and predict what inputs create an excellent wine. In order to conduct any further analysis, I would love to be able to bulk up the existing data set with global wines with the additional key data elements that I feel are really key to being able to predict Red Wine quality.